Domácí úkol č. 2 - 🕸 Síťová analýza PID¶

📚 Data¶

Jako dataset nám poslouží otevřená data Pražské integrované dopravy. Konkrétně budeme pracovat s daty cestovních řádů, která jsou původně ve formátu GTFS (General Transit Feed Specification). To je formát, který využívá široká škála softwarových aplikací a kvůli tomu jej při publikaci dat využívají také veřejné dopravní agentury včetně PID.

☝️Pozor! Pro řešení domácí úlohy vám poskytneme už připravený dataset $D$ ve formátu csv. Dataset $D$ jsme pro vás sestavili z dat, která pocházejí z cestovních řádů. Více informací o všech souborech a jejich atributech lze nalézt v dokumentaci formátu GTFS.

Zadání¶

☝️ Používejte markdown buňky! Zdůvodňujte všechny důležité kroky, popisujte vizualizace a co je z nich možné pozorovat. Za nepřehledný domácí úkol nebudou uděleny body.

Za řádné průběžné komentování a vizuální prezentaci postupu a výsledků lze získat až 4 body. Úkol řešíte jako jednotlivci.

✨ Dataset

  • Načtěte si data ze souboru d.csv, což je již zmíněný dataset $D$, který obsahuje záznam pro každé dvě po sobě jdoucí zastávky nějakého spoje.

  • Struktura je následující (pro zjednodušení neuvažujeme service start_date a end_date): | stop_from | stop_from_name | stop_to | stop_to_name | depart_from | arrive_to | route_type | is_night | mon | tue | wed | thu | fri | sat | sun | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | U699Z3P | Stadion Strahov | U981Z1P | Koleje Strahov | 7:24:00 | 7:25:00 | 3 | 0 | 1 | 1 | 1 | 1 | 1 | 0 | 0 |

  • Za jedinečný identifikátor zastávky můžete považovat název zastávky. Pozor na stejné názvy zastávek pro různé dopravní prostředky - u takových zastávek můžete uvažovat, že se jedná o jednu a tutéž stanici (v mnoha případech to i platí).

⚙️ Předzpracování dat

  • Atributy depart_from a arrive_to budete chtít nejspíše upravit. Obsahují totiž časové údaje delší než 24 hodin (celkem se jedná o 1.5 % záznamů). Z reference formátu GTFS (info o sloupci, z kterého vznikl depart_from a arive_to): Arrival time at a specific stop for a specific trip on a route. … For times occurring after midnight on the service day, enter the time as a value greater than 24:00:00 in HH:MM:SS local time for the day on which the trip schedule begins. Nicméně narazíte i na chybné časy, které začínají s hodnotou větší než 24. Všechny tyto případy můžete vyřešit pomocí modulo 24 ☝️.

🕸️ Základní síťová analýza (celkem 12 bodů)

Úkolem je analyzovat síť zastávek PID. Zastávky tedy budou uzly sítě. Mezi dvěma zastávkami je orientovaná hrana, pokud jsou to dvě po sobě jdoucí zastávky nějakého spoje (existuje alespoň jeden záznam v datasetu $D$ s odpovídajícími stop_from, stop_to). Váha hrany je rovna počtu dopravních prostředků, které na dané trase za období jednoho týdne projedou.

Postupujte následovně:

  • Začněte volbou libovolného balíčku pro analýzu a vizualizaci sítí (lze využít i zmíněný NetworkX),
  • z datasetu $D$ vytvořte reprezentaci dat, která je vhodná pro vámi zvolený vizualizační balíček,
  • vytvořte vizualizaci sítě (celkem za 4 body) - vizualizace musí být čitelná, proto můžete vizualizovat i podčást sítě (např. pro určitý dopravní prostředek, např. tramvaje (kromě vizualizace sítě lanovky nebo metra, tu neuznáváme) nebo nějaký podgraf - řešení ponecháme na vás),
  • pomocí alespoň tří měr centrality analyzujte důležitost zastávek za období jednoho týdne (pondělí - neděle) a komentujte slovně, co tyto míry vzhledem ke konkrétním datům znamenají (každá míra za 2 body, celkem tedy za 6 bodů),
  • vytvořte vizualizaci pro alespoň jednu míru centrality (celkem za 2 body).

❓ Vlastní otázky (3 body za každou otázku, celkem max. 9 bodů)

Vytvořte 3 otázky (můžete i více), založené na filtraci datasetu $D$ a odpovídejte na ně vhodnými vizualizacemi. Otázky pro inspiraci:

  • Mění se důležité zastávky v závislosti na denním/nočním provozu?
  • Je rozdíl ve vytíženosti zastávek během pracovního týdne/víkendu?
  • ...

🔥 Data navíc

V souboru stops.txt je u každé zastávky uvedena zeměpisná šířka a délka. Tato data můžete využít pro rozšíření své analýzy a také vám mohou pomoci při layoutování grafu. ☝️ Pozor na stejné názvy zastávek s trochu jinou lokací pro různé dopravní prostředky. Je třeba navrhnout nějaké řešení (např. první, průměr, těžiště mnohoúhelníku apod., libovolně dle vašeho úsudku) a to zdůvodnit.

Bodové hodnocení¶

Shrnutí bodů, které můžete nejvýše získat:

  • 4 body za průběžné komentáře a vizuální prezentaci postupu a výsledků,
  • 4 body za vizualizaci sítě,
  • 6 bodů za komentovanou analýzu alespoň 3 měr centrality,
  • 2 body za vizualizaci jedné z měr centrality,
  • 9 bodů za definici a zodpovězení minimálně tří vlastních otázek.

Celkem lze za domácí úkol č. 2 získat maximálně 25 bodů.

Homework¶

Importing libraries & reading files¶

In [1]:
pip install plotly networkx pandas matplotlib scipy
Requirement already satisfied: plotly in /home/mkh-uni/venvs/machine-learn/lib/python3.12/site-packages (5.24.1)
Requirement already satisfied: networkx in /home/mkh-uni/venvs/machine-learn/lib/python3.12/site-packages (3.4.2)
Requirement already satisfied: pandas in /home/mkh-uni/venvs/machine-learn/lib/python3.12/site-packages (2.2.2)
Requirement already satisfied: matplotlib in /home/mkh-uni/venvs/machine-learn/lib/python3.12/site-packages (3.8.4)
Requirement already satisfied: scipy in /home/mkh-uni/venvs/machine-learn/lib/python3.12/site-packages (1.14.1)
Requirement already satisfied: tenacity>=6.2.0 in /home/mkh-uni/venvs/machine-learn/lib/python3.12/site-packages (from plotly) (9.0.0)
Requirement already satisfied: packaging in /home/mkh-uni/venvs/machine-learn/lib/python3.12/site-packages (from plotly) (23.2)
Requirement already satisfied: numpy>=1.26.0 in /home/mkh-uni/venvs/machine-learn/lib/python3.12/site-packages (from pandas) (1.26.4)
Requirement already satisfied: python-dateutil>=2.8.2 in /home/mkh-uni/venvs/machine-learn/lib/python3.12/site-packages (from pandas) (2.9.0.post0)
Requirement already satisfied: pytz>=2020.1 in /home/mkh-uni/venvs/machine-learn/lib/python3.12/site-packages (from pandas) (2024.1)
Requirement already satisfied: tzdata>=2022.7 in /home/mkh-uni/venvs/machine-learn/lib/python3.12/site-packages (from pandas) (2024.1)
Requirement already satisfied: contourpy>=1.0.1 in /home/mkh-uni/venvs/machine-learn/lib/python3.12/site-packages (from matplotlib) (1.3.0)
Requirement already satisfied: cycler>=0.10 in /home/mkh-uni/venvs/machine-learn/lib/python3.12/site-packages (from matplotlib) (0.12.1)
Requirement already satisfied: fonttools>=4.22.0 in /home/mkh-uni/venvs/machine-learn/lib/python3.12/site-packages (from matplotlib) (4.54.1)
Requirement already satisfied: kiwisolver>=1.3.1 in /home/mkh-uni/venvs/machine-learn/lib/python3.12/site-packages (from matplotlib) (1.4.7)
Requirement already satisfied: pillow>=8 in /home/mkh-uni/venvs/machine-learn/lib/python3.12/site-packages (from matplotlib) (10.4.0)
Requirement already satisfied: pyparsing>=2.3.1 in /home/mkh-uni/venvs/machine-learn/lib/python3.12/site-packages (from matplotlib) (3.1.4)
Requirement already satisfied: six>=1.5 in /home/mkh-uni/venvs/machine-learn/lib/python3.12/site-packages (from python-dateutil>=2.8.2->pandas) (1.16.0)
Note: you may need to restart the kernel to use updated packages.
In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import networkx as nx
import scipy.stats as stats
import plotly.graph_objects as go
import plotly.express as px
In [3]:
d = pd.read_csv('d.csv', dtype={
    'monday':bool, 'tuesday':bool, 'wednesday':bool, 'thursday':bool, 'friday':bool, 'saturday':bool, 'sunday':bool, 'is_night':bool, 
    'route_type':'uint8'
})

d.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1642433 entries, 0 to 1642432
Data columns (total 15 columns):
 #   Column          Non-Null Count    Dtype 
---  ------          --------------    ----- 
 0   stop_from       1642433 non-null  object
 1   stop_from_name  1642433 non-null  object
 2   stop_to         1642433 non-null  object
 3   stop_to_name    1642433 non-null  object
 4   depart_from     1642433 non-null  object
 5   arrive_to       1642433 non-null  object
 6   route_type      1642433 non-null  uint8 
 7   is_night        1642433 non-null  bool  
 8   monday          1642433 non-null  bool  
 9   tuesday         1642433 non-null  bool  
 10  wednesday       1642433 non-null  bool  
 11  thursday        1642433 non-null  bool  
 12  friday          1642433 non-null  bool  
 13  saturday        1642433 non-null  bool  
 14  sunday          1642433 non-null  bool  
dtypes: bool(8), object(6), uint8(1)
memory usage: 89.3+ MB
In [4]:
stops = pd.read_csv('stops.txt')
stops.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16435 entries, 0 to 16434
Data columns (total 13 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   stop_id              16435 non-null  object 
 1   stop_name            15936 non-null  object 
 2   stop_lat             16435 non-null  float64
 3   stop_lon             16435 non-null  float64
 4   zone_id              15408 non-null  object 
 5   stop_url             0 non-null      float64
 6   location_type        16435 non-null  int64  
 7   parent_station       954 non-null    object 
 8   wheelchair_boarding  16435 non-null  int64  
 9   level_id             954 non-null    object 
 10  platform_code        14750 non-null  object 
 11  asw_node_id          15748 non-null  float64
 12  asw_stop_id          15354 non-null  float64
dtypes: float64(5), int64(2), object(6)
memory usage: 1.6+ MB

Preprocessing¶

D dataset¶

In [5]:
import unicodedata

def remove_diacritics(x):
    '''
        Taken from here: https://stackoverflow.com/questions/49891778/conversion-utf-to-ascii-in-python-with-pandas-dataframe
    '''
    if x is np.nan or x is None:
        return x
    return unicodedata.normalize('NFKD', x).encode('ascii', 'ignore').decode() 

d.head()
Out[5]:
stop_from stop_from_name stop_to stop_to_name depart_from arrive_to route_type is_night monday tuesday wednesday thursday friday saturday sunday
0 U2991Z301 Hněvice T58005 Hněvice seř.n. 4:53:00 4:54:30 2 False True True True True True False False
1 T58005 Hněvice seř.n. U4610Z301 Záluží 4:54:30 4:56:00 2 False True True True True True False False
2 U4610Z301 Záluží U4609Z301 Dobříň 4:56:00 4:59:00 2 False True True True True True False False
3 U4609Z301 Dobříň U4608Z301 Roudnice nad Labem 4:59:00 5:03:00 2 False True True True True True False False
4 U4608Z301 Roudnice nad Labem U4609Z301 Dobříň 4:36:00 4:38:00 2 False True True True True True False False

We will transform "depart_from" and "arrive_to" features into date time, because it will be easier to work with them that way.

In [6]:
def trnsm(s):
    l = s.split(':')
    l[0] = str(int(l[0]) % 24)
    return ':'.join(l)

d['depart_from'] = pd.to_datetime(d['depart_from'].apply(trnsm), format='%X')
d['arrive_to'] = pd.to_datetime(d['arrive_to'].apply(trnsm), format='%X')
d.head()
Out[6]:
stop_from stop_from_name stop_to stop_to_name depart_from arrive_to route_type is_night monday tuesday wednesday thursday friday saturday sunday
0 U2991Z301 Hněvice T58005 Hněvice seř.n. 1900-01-01 04:53:00 1900-01-01 04:54:30 2 False True True True True True False False
1 T58005 Hněvice seř.n. U4610Z301 Záluží 1900-01-01 04:54:30 1900-01-01 04:56:00 2 False True True True True True False False
2 U4610Z301 Záluží U4609Z301 Dobříň 1900-01-01 04:56:00 1900-01-01 04:59:00 2 False True True True True True False False
3 U4609Z301 Dobříň U4608Z301 Roudnice nad Labem 1900-01-01 04:59:00 1900-01-01 05:03:00 2 False True True True True True False False
4 U4608Z301 Roudnice nad Labem U4609Z301 Dobříň 1900-01-01 04:36:00 1900-01-01 04:38:00 2 False True True True True True False False

Now we will calculate route duration from one stop to another, it can be later transfomed into meaningful weight for edges.

We will also compute number of days, that given commute is active for further computation.

We will remove diacritics from names of stops to unify their spelling and ease programming. We will drop unnecessary columns.

In [7]:
ser = d['arrive_to'] - d['depart_from']

ser[d['arrive_to'] < d['depart_from']] += pd.Timedelta("1 day")

d['route_duration'] = ser

d['days_active'] = d[['monday', 'tuesday', 'wednesday', 'thursday', 'friday', 'saturday', 'sunday']].T.sum()

d['stop_from_name'] = d['stop_from_name'].apply(remove_diacritics).str.lower()
d['stop_to_name'] = d['stop_to_name'].apply(remove_diacritics).str.lower()

d = d.drop(columns=['stop_to', 'stop_from'])

d.head(3)
Out[7]:
stop_from_name stop_to_name depart_from arrive_to route_type is_night monday tuesday wednesday thursday friday saturday sunday route_duration days_active
0 hnevice hnevice ser.n. 1900-01-01 04:53:00 1900-01-01 04:54:30 2 False True True True True True False False 0 days 00:01:30 5
1 hnevice ser.n. zaluzi 1900-01-01 04:54:30 1900-01-01 04:56:00 2 False True True True True True False False 0 days 00:01:30 5
2 zaluzi dobrin 1900-01-01 04:56:00 1900-01-01 04:59:00 2 False True True True True True False False 0 days 00:03:00 5

I was able to find description of route type in GTFS documentation, so let's retype this column

In [8]:
route_type = pd.CategoricalDtype([
    'Tram', 'Subway', 'Rail', 'Bus', 'Ferry', 'Cable Tram', 'Air Lift', 'Funicular'
])

d['route_type'] = pd.Categorical.from_codes(d['route_type'], dtype=route_type)
display(np.sort(d['route_type'].unique()))
d.head()
array(['Bus', 'Ferry', 'Funicular', 'Rail', 'Subway', 'Tram'],
      dtype=object)
Out[8]:
stop_from_name stop_to_name depart_from arrive_to route_type is_night monday tuesday wednesday thursday friday saturday sunday route_duration days_active
0 hnevice hnevice ser.n. 1900-01-01 04:53:00 1900-01-01 04:54:30 Rail False True True True True True False False 0 days 00:01:30 5
1 hnevice ser.n. zaluzi 1900-01-01 04:54:30 1900-01-01 04:56:00 Rail False True True True True True False False 0 days 00:01:30 5
2 zaluzi dobrin 1900-01-01 04:56:00 1900-01-01 04:59:00 Rail False True True True True True False False 0 days 00:03:00 5
3 dobrin roudnice nad labem 1900-01-01 04:59:00 1900-01-01 05:03:00 Rail False True True True True True False False 0 days 00:04:00 5
4 roudnice nad labem dobrin 1900-01-01 04:36:00 1900-01-01 04:38:00 Rail False True True True True True False False 0 days 00:02:00 5

Stops dataset¶

In [9]:
stops.head()
Out[9]:
stop_id stop_name stop_lat stop_lon zone_id stop_url location_type parent_station wheelchair_boarding level_id platform_code asw_node_id asw_stop_id
0 U50S1 Budějovická 50.044411 14.448787 P NaN 1 NaN 1 NaN NaN 50.0 NaN
1 U52S1 Chodov 50.031672 14.490961 P NaN 1 NaN 1 NaN NaN 52.0 NaN
2 U75S1 Kolbenova 50.110395 14.516398 P NaN 1 NaN 1 NaN NaN 75.0 NaN
3 U78S1 Ládví 50.126591 14.469451 P NaN 1 NaN 1 NaN NaN 78.0 NaN
4 U100S1 Vltavská 50.100298 14.438492 P NaN 1 NaN 1 NaN NaN 100.0 NaN

We will unify zones P, B and 0 and remove diacritics from stops' names and put it into lowercase.

In [10]:
stops['zone_id'] = stops['zone_id'].str.extract(r'(.)$', expand=False)\
                                .replace({'P' : '0', 'B' : '0'})\
                                .str.extract(r'(\d+)', expand=False)\
                                .astype('UInt8')

stops['stop_name_diacritics'] = stops['stop_name']
stops['stop_name'] = stops['stop_name'].apply(remove_diacritics).str.lower()

Next step we will aggregate all possible platforms into stations. We will compute mean of platforms and mode of zones, if some station has stops in different zones. We will also account for wheelchair accessibility.

In [11]:
stations = stops.groupby('stop_name').agg({
    'stop_lat' : 'mean',
    'stop_lon' : 'mean',
    'zone_id' : lambda x: stats.mode(x)[0],
    'stop_name_diacritics' : lambda x: list(x)[0],
    'wheelchair_boarding' : 'any'
}).reset_index()
stations.head()
Out[11]:
stop_name stop_lat stop_lon zone_id stop_name_diacritics wheelchair_boarding
0 ahr km 11,485 50.146550 14.731470 <NA> AHr Km 11,485 False
1 albertov 50.067917 14.420798 0 Albertov True
2 ametystova 49.988201 14.362216 0 Ametystová False
3 amforova 50.041778 14.327298 0 Amforová False
4 andel 50.071132 14.403406 0 Anděl True
In [12]:
stations.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7542 entries, 0 to 7541
Data columns (total 6 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   stop_name             7542 non-null   object 
 1   stop_lat              7542 non-null   float64
 2   stop_lon              7542 non-null   float64
 3   zone_id               6835 non-null   UInt8  
 4   stop_name_diacritics  7542 non-null   object 
 5   wheelchair_boarding   7542 non-null   bool   
dtypes: UInt8(1), bool(1), float64(2), object(2)
memory usage: 257.9+ KB

Visualizing the network¶

Gathering graph¶

We will select only those routes, on which trams operate during daytime (it its not a night route).

In [13]:
routes = d[(d['route_type'] == 'Tram') & ~d['is_night']]
print(f"Number of samples {len(routes)}")
routes.head()
Number of samples 557785
Out[13]:
stop_from_name stop_to_name depart_from arrive_to route_type is_night monday tuesday wednesday thursday friday saturday sunday route_duration days_active
35575 vozovna motol motol 1900-01-01 06:07:00 1900-01-01 06:08:00 Tram False True True True True True False False 0 days 00:01:00 5
35576 motol krematorium motol 1900-01-01 06:08:00 1900-01-01 06:09:00 Tram False True True True True True False False 0 days 00:01:00 5
35577 krematorium motol hlusickova 1900-01-01 06:09:00 1900-01-01 06:10:00 Tram False True True True True True False False 0 days 00:01:00 5
35578 hlusickova slanska 1900-01-01 06:10:00 1900-01-01 06:12:00 Tram False True True True True True False False 0 days 00:02:00 5
35579 slanska blatiny 1900-01-01 06:12:00 1900-01-01 06:13:00 Tram False True True True True True False False 0 days 00:01:00 5

Now we will aggregate over "stop_from_name" and "stop_to_name" to gather individual edges. Then we will transform given dataframe into list edges

In [14]:
edges = routes.groupby(['stop_from_name', 'stop_to_name',], as_index=False).size().drop(columns=['size'])
print(f"Number of edges: {edges.shape[0]}")
edges[:5]
Number of edges: 657
Out[14]:
stop_from_name stop_to_name
0 albertov botanicka zahrada
1 albertov ostrcilovo namesti
2 albertov vyton
3 andel arbesovo namesti
4 andel bertramka

Now we will select only those stations, that trams stop at. It will be easier to visualize with them those stops extracted.

In [15]:
tram_stations_set = set(routes['stop_from_name']) | set(routes['stop_from_name'])
tram_stations = stations[stations['stop_name'].isin(tram_stations_set)]
print(f"Number of nodes: {tram_stations.shape[0]}")
tram_stations.head()
Number of nodes: 281
Out[15]:
stop_name stop_lat stop_lon zone_id stop_name_diacritics wheelchair_boarding
1 albertov 50.067917 14.420798 0 Albertov True
4 andel 50.071132 14.403406 0 Anděl True
7 arbesovo namesti 50.075978 14.404773 0 Arbesovo náměstí True
9 arena liben jih 50.102892 14.494592 0 Arena Libeň jih True
42 balabenka 50.104022 14.482587 0 Balabenka True

Plotting network¶

We will use plotly for plotting, because we have almost 300 nodes, so no static image ever could show any information about this network, and plotly provides an interactive plotting tool. Now we will collect coordinates for go.Scattergeo object.

In [16]:
edge_lon, edge_lat = [], []

def update_edges_coords(x, upd_dicts=(edge_lon, edge_lat)):
    edge_lon, edge_lat = upd_dicts
    from_edge = tram_stations[tram_stations['stop_name'] == x['stop_from_name']].iloc[0]
    to_edge = tram_stations[tram_stations['stop_name'] == x['stop_to_name']].iloc[0]
    edge_lat += [from_edge['stop_lat'], to_edge['stop_lat'], None]
    edge_lon += [from_edge['stop_lon'], to_edge['stop_lon'], None]

edges.apply(update_edges_coords, axis=1)
edge_lat[:6]
Out[16]:
[50.067917, 50.070775999999995, None, 50.067917, 50.0652065, None]

We will display vltava so it will be easier to orient around the plot. It will be interpolated by 24 points, that we have manually chosen on a map.

In [17]:
vltava_lat = [ 50.14248, 50.1365, 50.12886, 50.11769, 50.11295, 50.11438, 50.11141, 50.11376, 50.10585, 50.09891, 50.09429, 50.09269, 50.08641, 50.07576, 50.0617, 50.05501, 50.02558, 49.99921, 49.99193, 49.98531, 49.97197 ]
vltava_lon = [14.39662, 14.39214, 14.40063, 14.39651, 14.40767, 14.41857, 14.42964, 14.44538, 14.46029, 14.45462, 14.42458, 14.416, 14.41102, 14.41019, 14.41487, 14.41506, 14.39678, 14.40228, 14.39077, 14.37155, 14.34549 ]
In [18]:
from plotly.graph_objs.scattergeo import Marker, Line

fig = go.Figure()


fig.add_trace(
    go.Scattergeo(
        name='Vltava',
        lat=vltava_lat,
        lon=vltava_lon,
        mode='lines',
        hoverinfo='skip',

        line=Line(
            color='#7ACCE1',
            width=8
        )
    )
)

fig.add_trace(
    go.Scattergeo(
        name='Tram routes',
        lat=edge_lat,
        lon=edge_lon,
        mode='lines',
        hoverinfo='skip',

        line=Line(
            color='#65151D'
        )
    )
)

fig.add_trace(
    go.Scattergeo(
        name='Stations',
        lat=tram_stations['stop_lat'].values,
        lon=tram_stations['stop_lon'].values,
        text=tram_stations['stop_name_diacritics'].values,
        mode='markers',
        hoverinfo='text',

        marker=Marker(
            color='#e6e3e3',
            size=8,
            line_width=2,
        )
    )
)

fig.update_geos(
    fitbounds='locations',
    scope='europe',
    visible=False
)

fig.update_layout(
    height=1200,
    width=1200, 
    margin={"r":0,"t":0,"l":0,"b":0},
    legend=dict(
        yanchor="top",
        y=0.96,
        xanchor="left",
        x=0.01
    )
)
fig.show()

Centrality measures¶

Constructing networkx graphs¶

We will construct both directed mulitgraph, where each edge represents a literal connection dependent on time, and common graph, where each edge represents existence of such connection.

In [19]:
MG = nx.MultiDiGraph()

MG.add_edges_from(d[['stop_from_name', 'stop_to_name']].values)

print(f"Multigraph has {len(MG.nodes())} nodes")
print(f"Multigraph has {len(MG.edges())} edges")
Multigraph has 7391 nodes
Multigraph has 1642433 edges
In [20]:
G = nx.DiGraph()

G.add_edges_from(
    d.groupby(['stop_from_name', 'stop_to_name']).size().reset_index().drop(columns=[0]).values
)

print(f"Graph has {len(G.nodes())} nodes")
print(f"Graph has {len(G.edges())} edges")
Graph has 7391 nodes
Graph has 18465 edges

Calculating centralities¶

Degree centrality¶

Degree centrality assigns each node importance based of it's degree. It makes sense to calculate it for our multigraph, because many incoming and outcoming edges of a node represent high frequency, so more busy stations will have higher score.

Networkx normalizes every degree by $|V(G)| - 1$, where $|V(G)|$ is number of vertices in graph, we will convert it back to whole values.

In [21]:
deg_cent = pd.DataFrame.from_dict(
    nx.degree_centrality(MG), orient='index'
).rename(columns={0 : 'score'}).sort_values('score', ascending=False) * (len(MG) - 1.0)
deg_cent['score'] = deg_cent['score'].astype(np.uint32)
display(deg_cent.describe().T)
print("Top 10 stations with the most neighboring connections")
deg_cent.head(10).T
count mean std min 25% 50% 75% max
score 7391.0 444.441348 1217.950157 1.0 40.0 84.0 245.5 17206.0
Top 10 stations with the most neighboring connections
Out[21]:
andel karlovo namesti i. p. pavlova palmovka lihovar malostranska narodni trida narodni divadlo ohrada zelivskeho
score 17206 16043 12613 12515 12256 11889 11698 11488 11451 11274

We can see that "the most centered" ones are so called transport hubs, where many means of transport meat. They also happen to be very busy.

Betweeness centrality¶

Betweeness centrality computes number of shortest pathes goingh through given node. Assumption under which it works is that graph will have "radial" shape, and it will be faster to go through center to reach other nodes.

We need to compute shortest path from every node to every other that is accessible from source, it is done by running BFS from every node, so it will have $O(n^2 + nm)$ asymptotics, where $n$ is number of nodes and $m$ is number of edges in graph. So number of operations is at least $7391 ^ 2 + 7391 \cdot 18465 = 1.9e8$ operations. This calculation does not assume our choice of language (Python), so this number is probably much larger. Summing up, it will take eternity to finish this computation, so we will approximate it by using only $k$ randomly chosen nodes as sources for BFS.

In [22]:
btwn_cent = pd.DataFrame.from_dict(
    nx.betweenness_centrality(G, k=1000, seed=42), orient='index'
).rename(columns={0 : 'score'}).sort_values('score', ascending=False)
display(btwn_cent.describe().T)
print("Top 10 stations with the highset betweeness centrality score")
btwn_cent.head(10).T
count mean std min 25% 50% 75% max
score 7391.0 0.002611 0.006875 0.0 0.00019 0.000824 0.002396 0.16603
Top 10 stations with the highset betweeness centrality score
Out[22]:
roztyly lihovar cerny most chrastany ostredek zlicin nemocnice krc lhotka hlavni nadrazi i. p. pavlova
score 0.16603 0.136086 0.12791 0.119674 0.106884 0.106022 0.099827 0.098467 0.089731 0.082423

This metric puts more emphasis on "close to travel" aspect of centerness. Because we have all stations that PID operates on, this metric pushes those, which have more connection to outer parts of network (higher values of zone_id).

Eigenvector centrality¶

Eigenvector centrality estimates importance of a node through recursive assumption that important nodes link to other important nodes. This leads to a system of equations, which can be represented with a matrix-vector equation. Its solution is eigenvector of adjecency matrix. It can be found iteratively using power method.

In [23]:
ev_cent = pd.DataFrame.from_dict(
    nx.eigenvector_centrality(G, max_iter=1000), orient='index'
).rename(columns={0 : 'score'}).sort_values('score', ascending=False)
display(ev_cent.describe().T)
print("Top 10 stations with the highset betweeness centrality score")
ev_cent.head(10).T
count mean std min 25% 50% 75% max
score 7391.0 0.001331 0.011556 3.174914e-99 2.463414e-09 4.606858e-07 0.000017 0.363673
Top 10 stations with the highset betweeness centrality score
Out[23]:
cerny most turnov,terminal u zel.st. mlada boleslav,aut.st. mlada boleslav,boleslavska mlada boleslav,zaluzany skoda mlada boleslav,jicinska mnichovo hradiste,zdrav.str. mlada boleslav,novy hrbitov svijany,na najezdu kosmonosy,transcentrum
score 0.363673 0.31347 0.304207 0.289093 0.213959 0.186643 0.184919 0.169564 0.155258 0.151255

This metric does a relative score, so it will put higher those stations, that have bridge-like quality. There is a big autobus station on "Černý most" and several others from top ten are also different sorts of autobus stations.

Plotting degree centrality with network¶

We will plot tram network with scores from degree centrality over tram network. It is not the most appropriate way to do that, but it is one of the best, that we can do.

In [24]:
tram_stations_score = pd.merge(deg_cent.reset_index(), tram_stations, left_on='index', right_on='stop_name')
tram_stations_score.head()
Out[24]:
index score stop_name stop_lat stop_lon zone_id stop_name_diacritics wheelchair_boarding
0 andel 17206 andel 50.071132 14.403406 0 Anděl True
1 karlovo namesti 16043 karlovo namesti 50.075203 14.418322 0 Karlovo náměstí True
2 i. p. pavlova 12613 i. p. pavlova 50.075074 14.430867 0 I. P. Pavlova True
3 palmovka 12515 palmovka 50.103996 14.474715 0 Palmovka True
4 lihovar 12256 lihovar 50.050508 14.409909 0 Lihovar True
In [25]:
from plotly.graph_objs.scattergeo import Marker, Line

fig = go.Figure()


fig.add_trace(
    go.Scattergeo(
        name='Vltava',
        lat=vltava_lat,
        lon=vltava_lon,
        mode='lines',
        hoverinfo='skip',

        line=Line(
            color='#7ACCE1',
            width=8
        )
    )
)

fig.add_trace(
    go.Scattergeo(
        name='Tram routes',
        lat=edge_lat,
        lon=edge_lon,
        mode='lines',
        hoverinfo='skip',

        line=Line(
            color='#65151D'
        )
    )
)

fig.add_trace(
    go.Scattergeo(
        name='Stations',
        lat=tram_stations_score['stop_lat'],
        lon=tram_stations_score['stop_lon'],
        text=tram_stations_score['stop_name_diacritics'] + ' (' + tram_stations_score['score'].astype(str) + ')',
        mode='markers',
        hoverinfo='text',

        marker=Marker(
            color=tram_stations_score['score'],
            colorscale='thermal',
            size=12,
            line_width=1,
            colorbar=dict(
                title="Number of connections coming and leaving station",
                len=0.5
            )
        )
    )
)

fig.update_geos(
    fitbounds='locations',
    scope='europe',
    visible=False
)

fig.update_layout(
    height=1200,
    width=1200, 
    margin={"r":0,"t":0,"l":0,"b":0},
    legend=dict(
        yanchor="top",
        y=0.96,
        xanchor="left",
        x=0.01
    )
)
fig.show()

Own questions¶

Betweenness centrality for weekends routes between 5:00 and 6:00¶

There is less motion on weekends and on early mornings, though PID still need to cover most of the station. Hypothetically this leads to less centralized networks, so we will compare betweenness centrality of whole network with betweenness centrality of weekend early mornning centrality.

In [26]:
d_wm = d[
    d['saturday'] & 
    d['sunday'] & 
    (pd.to_datetime('05:00:00', format='%X') <= d['depart_from']) &
    (d['arrive_to'] <= pd.to_datetime('06:00:00', format='%X'))
]
d_wm.head()
Out[26]:
stop_from_name stop_to_name depart_from arrive_to route_type is_night monday tuesday wednesday thursday friday saturday sunday route_duration days_active
8 hnevice hnevice ser.n. 1900-01-01 05:53:00 1900-01-01 05:54:30 Rail False True True True True True True True 0 days 00:01:30 7
9 hnevice ser.n. zaluzi 1900-01-01 05:54:30 1900-01-01 05:56:00 Rail False True True True True True True True 0 days 00:01:30 7
10 zaluzi dobrin 1900-01-01 05:56:00 1900-01-01 05:59:00 Rail False True True True True True True True 0 days 00:03:00 7
15 hnevice ser.n. hnevice 1900-01-01 05:02:00 1900-01-01 05:03:00 Rail False False False False False False True True 0 days 00:01:00 2
28 roudnice nad labem dobrin 1900-01-01 05:54:00 1900-01-01 05:56:30 Rail False True True True True True True True 0 days 00:02:30 7
In [27]:
MG_wm = nx.MultiDiGraph()
MG_wm.add_edges_from(d_wm[['stop_from_name', 'stop_to_name']].values)

print(f"Multigraph has {len(MG_wm.nodes())} nodes")
print(f"Multigraph has {len(MG_wm.edges())} edges")
Multigraph has 2812 nodes
Multigraph has 16474 edges
In [28]:
btwn_cent_wm = pd.DataFrame.from_dict(
    nx.betweenness_centrality(MG_wm, k=1000, seed=42), orient='index'
).rename(columns={0 : 'score'}).sort_values('score', ascending=False)
print('#' * 10, ' WEEKEND MORNING ', '#' * 10)
display(btwn_cent_wm.describe().T)
print("Top 10 stations with the highset betweeness centrality score of weekend early morning network.")
display(btwn_cent_wm.head(10).T)

print('#' * 10, ' ALLWEEK ', '#' * 10)
display(btwn_cent.describe().T)
print("Top 10 stations with the highset betweeness centrality score of allweek network")
display(btwn_cent.head(10).T)
##########  WEEKEND MORNING  ##########
count mean std min 25% 50% 75% max
score 2812.0 0.005246 0.013047 0.0 0.000134 0.001368 0.004378 0.159034
Top 10 stations with the highset betweeness centrality score of weekend early morning network.
zlicin florenc hlavni nadrazi u hangaru lihovar k letisti smichovske nadrazi terminal 1 muzeum vltavska
score 0.159034 0.132443 0.124987 0.124706 0.121288 0.112645 0.111882 0.097551 0.096744 0.096244
##########  ALLWEEK  ##########
count mean std min 25% 50% 75% max
score 7391.0 0.002611 0.006875 0.0 0.00019 0.000824 0.002396 0.16603
Top 10 stations with the highset betweeness centrality score of allweek network
roztyly lihovar cerny most chrastany ostredek zlicin nemocnice krc lhotka hlavni nadrazi i. p. pavlova
score 0.16603 0.136086 0.12791 0.119674 0.106884 0.106022 0.099827 0.098467 0.089731 0.082423

Can single mean of transportation cover all stops in the city of Prague (zone 0)?¶

We could have done it by selecting only those edges, that have chosen mean of transport as thier route type, then calling G.is_connected(), but we will do it differently.

We will calculate total number of stops in Prague and then compare it to the number of stations that are accessible by chosen mean of transport from "Hlavní nádraží". If any mean of transport have covered whole network, number of accessible stations would be the same as number of stations in Prague.

This approach gives us measure of coverege density by chosen mean of transportation.

In [29]:
prague_stations = stations[stations['zone_id'] == 0]
prg_stations_set = set(prague_stations['stop_name'])
N = prague_stations.shape[0]
print(f"Number of stations in prague: {N}")
prague_stations.head(2)
Number of stations in prague: 1594
Out[29]:
stop_name stop_lat stop_lon zone_id stop_name_diacritics wheelchair_boarding
1 albertov 50.067917 14.420798 0 Albertov True
2 ametystova 49.988201 14.362216 0 Ametystová False
In [30]:
G_prague = nx.Graph()
G_prague.add_nodes_from(prague_stations['stop_name'].values)

d_prague = d[d['stop_from_name'].isin(prg_stations_set) & d['stop_to_name'].isin(prg_stations_set)]

covering_mean_of_trns = []

for mean_of_transport in d['route_type'].unique():
    mt_edges = d_prague.loc[d_prague['route_type'] == mean_of_transport, ['stop_from_name', 'stop_to_name']].values
    G_prague.add_edges_from(mt_edges)
    n_reachable = len(nx.node_connected_component(G_prague, 'hlavni nadrazi'))
    print(f"By {mean_of_transport:<10} it is possible to reach {n_reachable} of {N} stations in Prague")
    if n_reachable == N:
        covering_mean_of_trns.append(mean_of_transport)
    G_prague.clear_edges()

print()
if not covering_mean_of_trns:
    print("No mean of transportation covers whole zero zone")
else:
    print("Zero zone can be covered by (separatly): {covering_mean_of_trns}")
By Rail       it is possible to reach 1 of 1594 stations in Prague
By Bus        it is possible to reach 1276 of 1594 stations in Prague
By Tram       it is possible to reach 281 of 1594 stations in Prague
By Ferry      it is possible to reach 1 of 1594 stations in Prague
By Funicular  it is possible to reach 1 of 1594 stations in Prague
By Subway     it is possible to reach 58 of 1594 stations in Prague

No mean of transportation covers whole zero zone

What are the longest shortest pathes on trams, ignoring waiting time?¶

This is a question from real world. Every few monthes my family comes to Prague and my mother and I love Prague and long walks around it, but my little old brother gets tired very fast, so we often surrender to "tram-walks", when we travel by tram across Prague. The longer our voyage is the better as my brother will have rested more.

Anecdotes aside, metric that corresponds for longest path is diameter of a graph, generally the longest path in the graph, NetworkX returns length of such path when G.diameter() is called. To retrieve path that we need we will use periphery method, it returns list of nodes that has epicentricity (length of the longest path starting in given node) equal to diameter. both of this methods can be generalized for weighted graphs.

We will take mean of all times, that $D$ dataset says as a duration between a pair of stations. There are several records that have duration of zero, we will not use those.

Because Prague trams go in both direction to and from every station we will represent our network as the undirected graph.

In [31]:
tram_routes = d[d['route_type'] == 'Tram'].groupby(['stop_from_name', 'stop_to_name']).agg(
    {'route_duration' : 'mean'}
).reset_index()
tram_routes.loc[:, ['route_duration_seconds']] = tram_routes['route_duration'].dt.seconds
tram_edges = tram_routes[tram_routes['route_duration_seconds'] != 0].drop(columns=['route_duration'])

Here is an exmaple of such instatnt route.

In [32]:
tram_routes[tram_routes['route_duration'].dt.seconds == 0].head(2)
Out[32]:
stop_from_name stop_to_name route_duration route_duration_seconds
237 levskeho sidliste modrany 0 days 0
318 nademlejnska u elektry 0 days 0
In [33]:
d.loc[d['stop_from_name'] == 'levskeho', ['stop_from_name', 'stop_to_name', 'depart_from', 'arrive_to']].tail(2)
Out[33]:
stop_from_name stop_to_name depart_from arrive_to
1523365 levskeho sidliste modrany 1900-01-01 04:16:00 1900-01-01 04:16:00
1523409 levskeho sidliste modrany 1900-01-01 04:16:00 1900-01-01 04:16:00

Now we can construct the graph and calculate diameter and find periphery nodes.

In [34]:
G_tram = nx.Graph()
G_tram.add_weighted_edges_from(
    tram_edges.values
)
periphery_nodes = nx.periphery(G_tram, weight='weight')
diameter = nx.diameter(G_tram, weight='weight')
print(f"Diameter is: {diameter}")
periphery_nodes
Diameter is: 3553
Out[34]:
['lehovec', 'sidliste modrany']

Let's find the shortest path between found stations.

In [35]:
path_nodes = nx.shortest_path(G_tram, periphery_nodes[0], periphery_nodes[1])
path_nodes_set = set(path_nodes)

We will distinguish all pathes by color, so we need to create new datasets, that we will be able to process into Plotly visualisible network.

In [36]:
tram_stations.loc[: , ['is_path']] = False
tram_stations.loc[tram_stations['stop_name'].isin(path_nodes_set), ['is_path']] = True

tram_edges.loc[:, ['is_path']] = False
tram_edges.loc[tram_edges['stop_from_name'].isin(path_nodes_set) & tram_edges['stop_to_name'].isin(path_nodes_set), ['is_path']] = True
tram_edges.head()
Out[36]:
stop_from_name stop_to_name route_duration_seconds is_path
0 albertov botanicka zahrada 60 False
1 albertov ostrcilovo namesti 60 False
2 albertov vyton 107 False
3 andel arbesovo namesti 86 False
4 andel bertramka 127 False

Finally we can convert all of them into lists of coordinates, suitable for Plotly.

In [37]:
tram_edge_lat = []
tram_edge_lon = []

path_edge_lat = []
path_edge_lon = []

def update_tram_edges_coords(x):
    global tram_edge_lat, tram_edge_lon, tram_edge_color, path_edge_lat, path_edge_lon, path_edge_color
    
    from_edge = tram_stations[tram_stations['stop_name'] == x['stop_from_name']].iloc[0]
    to_edge = tram_stations[tram_stations['stop_name'] == x['stop_to_name']].iloc[0]
    
    if x['is_path']:
        path_edge_lat += [from_edge['stop_lat'], to_edge['stop_lat'], None]
        path_edge_lon += [from_edge['stop_lon'], to_edge['stop_lon'], None]
    else:
        tram_edge_lat += [from_edge['stop_lat'], to_edge['stop_lat'], None]
        tram_edge_lon += [from_edge['stop_lon'], to_edge['stop_lon'], None]
        

tram_edges.apply(update_tram_edges_coords, axis=1)
path_edge_lat[:6]
Out[37]:
[50.102892, 50.1020335, None, 50.102892, 50.103544, None]

We will use similiar code as we have used for simple network, except we will plot pathes separately.

In [38]:
fig = go.Figure()


fig.add_trace(
    go.Scattergeo(
        name='Vltava',
        lat=vltava_lat,
        lon=vltava_lon,
        mode='lines',
        hoverinfo='skip',

        line=Line(
            color='#7ACCE1',
            width=8
        )
    )
)

fig.add_trace(
    go.Scattergeo(
        name='Tram routes',
        lat=tram_edge_lat,
        lon=tram_edge_lon,
        mode='lines+markers',
        hoverinfo='skip',

        line=Line(
            color='#ccbaba'
        )
    )
)


fig.add_trace(
    go.Scattergeo(
        name=f'Path. Total time: {diameter // 60}m {diameter % 60}s',
        lat=path_edge_lat,
        lon=path_edge_lon,
        mode='lines+markers',
        hoverinfo='skip',

        line=Line(
            color='#F5B700',
            width=8
        ),

        marker=Marker(
            color='#F5B700',
            size=8,
            line_width=2,
        )
    )
)

tram_stations_path = tram_stations[tram_stations['is_path']] 

fig.add_trace(
    go.Scattergeo(
        lat=tram_stations_path['stop_lat'].values,
        lon=tram_stations_path['stop_lon'].values,
        text=tram_stations_path['stop_name_diacritics'].values,
        mode='markers',
        hoverinfo='text',
        showlegend=False,

        marker=Marker(
            color='#F5B700',
            size=8,
            line_width=2,
        )
    )
)

fig.update_geos(
    fitbounds='locations',
    scope='europe',
    visible=False
)

fig.update_layout(
    height=1200,
    width=1200, 
    margin={"r":0,"t":0,"l":0,"b":0},
    legend=dict(
        yanchor="top",
        y=0.96,
        xanchor="left",
        x=0.01
    )
)
fig.show()

References¶

Pandas documentation

NetworkX docmentation

Plotly documentation

StackOverflow

Removing diacritics

In [ ]: